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Abstract 

Template  matching  by  means  of  cross-correlation  is  common  practice  in  pattern  recognition.  However, 
its  sensitivity  to  deformations  of  the  pattern  and  the  broad  and  unsharp  peaks  it  produces  are  signifi¬ 
cant  drawbacks.  This  paper  reviews  some  results  on  how  these  shortcomings  can  be  removed.  Several 
techniques  (Matched  Spatial  Filters,  Synthetic  Discriminant  Functions,  Principal  Components  Projections 
and  Reconstruction  Residuals)  are  reviewed  and  compared  on  a  common  task:  locating  eyes  in  a  database 
of  faces.  New  variants  are  also  proposed  and  compared:  least  squares  Discriminant  Functions  and  the 
combined  use  of  projections  on  eigenfunctions  and  the  corresponding  reconstruction  residuals.  Finally, 
approximation  networks  are  introduced  in  an  attempt  to  improve  filter  design  by  the  introduction  of 
nonlinearity. 
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1  Introduction 

The  detection  and  recognition  of  objects  from  their  im¬ 
ages,  irrespective  of  their  orientation,  scale,  and  view, 
is  a  very  important  research  subject  in  computer  vision, 
if  not  computer  vision  itself.  Several  techniques  have 
been  proposed  in  the  past  to  solve  this  challenging  prob¬ 
lem,  In  this  paper  we  will  focus  on  a  subset  of  these 
techniques,  those  employing  the  idea  of  projection  to 
match  image  patterns.  The  notion  of  Matched  Spatial 
Filter  (hereafter  MSF)  is  a  venerable  one  with  a  long 
history  [21].  While  by  itself  it  cannot  account  for  invari¬ 
ant  recognition,  it  can  be  coupled  to  invariant  mappings 
or  signal  expansions,  and  is  therefore  able  to  provide 
invariance  to  rotation  and  scaling  in  the  image  plane. 
In  order  to  cope  with  more  general  variations  of  the  ob¬ 
jects  views  more  sophisticated  approaches  have  to  be  em¬ 
ployed.  Among  them,  the  use  of  Synthetic  Discriminant 
Functions  [17,  9,  6,  14,  15,  20,  28,  19,  8,  7,  26,  18,  16]  is 
one  of  the  more  promising  so  far  developed.  In  these 
paper  we  will  follow  a  path  from  MSF,  to  expansion 
matching  through  different  variant  of  SDFs.  The  first 
section  describes  the  basic  properties  of  MSF,  their  op¬ 
timality  and  their  relation  to  the  probability  of  misclas- 
sifi cation.  The  generalization  of  MSF  to  a  linear  com¬ 
bination  of  example  images  is  introduced  next.  Several 
shortcomings  of  the  basic  approach  are  outlined  and  a 
set  of  possible  solutions  is  presented  in  the  subsequent 
section.  We  discuss  a  relation  of  the  resulting  class  of  fil¬ 
ters  to  nonorthogonal  image  expansion.  A  generalization 
to  projections  on  multiple  directions  and  the  use  of  the 
projection  residual  for  pattern  matching  is  then  investi¬ 
gated  [24,  22,  27,  29,  30,  31].  Finally,  a  more  powerful, 
non  linear  framework  is  introduced  in  which  template 
matching  can  be  looked  at  as  a  problem  of  function  ap¬ 
proximation.  Network  architectures  and  training  strate¬ 
gies  are  proposed  within  this  new  general  framework. 

2  Matched  Spatial  Filter 

Template  matching  is  extensively  used  in  low-level  vision 
tasks  to  localize  and  identify  patterns  in  images.  Two 
methods  are  commonly  used: 

1.  image  subtraction:  images  are  considered  as  vec¬ 
tors  and  the  norm  of  their  difference  is  considered 
as  a  measure  of  dissimilarity; 

2,  correlation:  the  dot  product  of  two  images  is  con¬ 
sidered  as  a  measure  of  their  similarity  (it  repre¬ 
sents  the  angle  between  the  images  when  they  are 
suitably  normalized  and  considered  as  vectors). 

When  the  images  are  normalized  to  have  zero  average 
and  unit  norm,  the  two  approaches  give  the  same  re¬ 
sult.  The  usual  implementation  of  the  above  methods 
relies  on  the  euclidean  distance.  Other  distances  can 
be  used  and  some  of  them  have  better  properties  such 
as  increased  robustness  to  noise  and  minor  deformations 
[4].  The  next  sections  are  mainly  concerned  with  the 
correlation  approach.  The  idea  of  image  subtraction  is 
introduced  again  in  the  more  general  nonlinear  frame¬ 
work. 


2.1  Optimality 

One  of  the  reasons  for  which  template  matching  by  corre¬ 
lation  is  commonly  used  is  that  correlation  can  be  shown 
to  be  the  optimal  (according  to  a  particular  criterion) 
linear  operation  by  which  a  deterministic  reference  func¬ 
tion  can  be  extracted  from  additive  white  Gaussian  noise 
[21].  Let  the  detected  signal  be: 

g{x)  -  (j){x)  +  X{x)  (1) 

where  <l>{x)  is  the  original  signal  and  A(a;)  is  noise  with 
power  spectrum  S{u)).  The  noise  is  assumed  to  be  wide- 
sense  stationary  with  zero  average  so  that: 

E{X{x)}  =  0 
E{X{x-\-a)X{x)}  =  R{a) 

We  assume  that  (f){x)  is  known  and  we  want  to  establish 
its  presence  and  location.  To  do  so  we  apply  to  the 
process  g{x)  a  linear  filter  with  impulse  response  h{x) 
and  system  function  H{lo).  The  resulting  output  is: 

/CX) 

g{x  —  a)h{a)da  (2) 

-oo 

=  (3) 

Using  the  convolution  theorem  for  the  Fourier  transform 
we  have  that: 


z^(x) 


X  —  a)h{a)da 


^oj)H{Lo)e^^^duj 


We  want  to  find  H(u;)  so  as  to  maximize  the  following 
signal  to  noise  ratio  (SNR): 


E{z\{xo)} 


(4) 


where  Xo  is  the  location  of  the  signal.  The  SNR  repre¬ 
sents  the  ratio  of  the  filter  responses  at  the  uncorrupted 
signal  and  at  the  noise.  It  is  defined  at  the  true  location 
of  the  signal  (usually  the  correlation  peak)  therefore  not 
taking  into  account  the  off-peak  response  of  the  filter. 
Two  cases  of  particular  interest  are  those  of  white  and 
colored  noise: 

White  Noise  This  type  of  noise  is  defined  by  the 
following  condition 


5(a;)  =  5o 

which  corresponds  to  a  flat  energy  spectrum.  The 
Schwartz  inequality  states  that 

I  /  f{x)g{x)dx\^  <  f  |/(a;)p  /  |f?(x)|2 

and  the  equality  hold  iff  f{x)  =  kg(x)  (we  use  ~  to  rep¬ 
resent  complex  conjugation).  This  implies  the  following 
bound  for  the  signal  to  noise  ratio  r: 

2TrSo  f  \H{u)\^duj 


1 


and  then 


where 

represents  the  energy  of  the  signal.  From  the  Schwartz 
inequality  the  equality  holds  only  if 


variance.  Another  useful  normalization  is  to  set  the  av¬ 
erage  value  of  the  vector  coordinates  to  zero.  This  oper¬ 
ation  corresponds  to  setting  the  average  of  the  intensity 
distribution  for  images.  These  normalization  are  par¬ 
ticularly  useful  when  modern  cameras  are  used,  as  they 
usually  operate  with  automatic  gain  level  (acting  on  the 
scale  of  the  intensity)  and  black  level  adjustment  (acting 
as  on  offset  on  the  intensity  distribution). 


H(u))  =  jfc^(a;)e“'*^^° 

The  spatial  domain  version  of  the  filter  is  simply  the 
mirror  image  of  the  signal: 

h(x)  =  k<j)(xQ  —  x) 

which  implies  that  the  convolution  of  the  signal  with  the 
filter  can  be  expressed  as  the  cross- correlation  with  the 
signal  (hence  the  name  Matched  Spatial  Filter). 

Colored  Noise  If  the  noise  has  a  non  flat  spectrum 
S{u))  it  is  said  to  be  colored.  In  this  case  the  following 
holds: 


2TrZtp{xo)  = 

|27rz^(a;o)P  = 


< 


/ 


s{^) 


-du) 


duj 


hence 


|^(a;)e*^^|2 

S{^) 


duj 


with  equality  holding  only  when 


The  main  consequence  of  the  color  of  noise  is  that  the 
optimal  filter  corresponds  to  a  modified  version  of  the 
signal 


H{uj)  =  k 


which  emphasizes  the  frequencies  where  the  energy  of  the 
noise  is  smaller.  The  optimal  filter  can  also  be  considered 
as  a  cascade  of  a  whitening  filter  5”^/^ (a;)  and  the  usual 
filter  based  on  the  transformed  signal. 

In  the  spatial  domain,  correlation  amounts  to  project¬ 
ing  the  signal  g(x)  onto  the  available  template  <l){x).  If 
the  norm  of  the  projected  signal  is  not  equal  to  that  of 
the  template,  the  value  of  the  projection  can  be  mean¬ 
ingless  as  the  projection  value  can  be  large  without  im¬ 
plying  that  the  two  vectors  are  close  in  any  reasonable 
sense.  The  solution  is  to  compute  the  projection  using 
normalized  vectors.  In  particular,  if  versors  are  used, 
computing  the  projection  amounts  to  computing  the  co¬ 
sine  of  the  angle  formed  by  the  two  vectors,  which  is 
an  effective  measure  of  similarity.  In  vision  tasks,  vec¬ 
tor  normalization  corresponds  to  adjusting  the  intensity 
scale  so  that  the  corresponding  distribution  has  a  given 


2.2  Distorted  Templates 

The  previous  analysis  was  focused  on  the  detection  of  a 
deterministic  signal  corrupted  by  noise.  An  interesting 
extension  is  the  detection  of  a  signal  belonging  to  a  given 
distribution  of  signals  [17].  As  an  example,  consider  the 
problem  of  locating  the  eyes  in  a  face  image.  We  do 
not  know  who’s  face  it  is  so  that  we  cannot  select  the 
corresponding  signal  (the  eyes  of  that  person),  A  whole 
set  of  different  eyes  could  be  available,  possibly  including 
the  correct  ones. 

Let  {<l>{x)]  denote  the  class  of  signals  to  be  detected. 
We  want  to  find  the  filter  h  which  maximizes  the  SNR 

over  the  class  of  signals  {(j)(x)}.  The  input  signal  <j)(x) 
can  be  modeled  as  a  sample  realization  of  the  stochastic 
process  {<^(x)}.  The  ensemble- average  correlation  func¬ 
tion  of  the  stochastic  process  is  defined  by 


K^^{x,  y)  =  E^{4>{x)4>{y)}  (5) 

and  represents  the  average  over  the  ensemble  of  signals 
(and  not  over  the  coordinates  of  a  signal).  What  we 
want  to  maximize  is  the  ensemble  average  of  the  signal 
to  noise  ratio: 


,  E{\z^{xo)\^} 

^  -  E{z-^x{xo)} 

Assume,  without  loss  of  generality,  that  a?o  =  0.  The 
average  SNR  can  then  be  rewritten  as: 

,  21  ^  S !K-^)K-y)KH{^,y)dxdy 
^  J  S  h{-x)h{-y)Kx\{x,y)dxdy 

where  the  ensemble  autocorrelation  function  of  the  sig¬ 
nal  and  noise  have  been  used.  The  autocorrelation  func¬ 
tion  of  the  white  noise  is  proportional  to  a  Dirac  delta 
function: 

Kx\{x,  y)  =  N6{x  -  y)  (8) 

SO  that  the  average  signal  to  noise  ratio  can  be  rewritten 
as: 


r_2i .  ! !K-^)K-y)KH{^^y)dxdy 
^  - - N!h(-xYdx - 

Pre- whitening  operators  can  be  applied  as  preprocessing 
functions  when  the  assumption  of  white  noise  does  not 
hold.  The  denominator  of  the  RHS  in  eqn.  (9)  represents 
the  energy  of  the  filter  and  we  can  require  it  to  be  1 : 


dx  =  \ 


(10) 


To  optimize  eqn.  (9)  we  must  maximize  the  numerator 
subject  to  the  energy  constraint  of  the  filter.  The  ensem¬ 
ble  auto- correlation  function  can  be  expressed  in  terms 


Figure  1:  The  probability  of  error,  represented  by  the 
shaded  area,  when  the  distributions  are  Gaussian  with 
the  same  covariance. 


of  the  orthonormal  eigenfunctions  of  the  integral  kernel 
K^^{x,y) 

2/)  =  ^  A,V’,(a:)i/’i(2/)  (H) 

i 

where  the  A*  are  the  corresponding  eigenvalues.  The 
filter  function  h  can  also  be  expanded  in  the  same  basis 

(12) 

i 

Using  the  inner  product  notation  and  the  orthonormality 
of  the  ^pi{x)  we  can  state  the  optimization  problem  as 
finding 

arg  max  Xi{h  •  (13) 

If  we  order  the  eigenvalues  so  that  Ai  >A2>  ...>Ajb  > 

. . ,,  we  have 

N-E^ir"^}  =  =  Ai 

i  i  i 

and  the  maximum  value  is  achieved  when  the  filter  func¬ 
tion  is  taken  to  be  the  dominant  eigenvector. 

2.3  Signal-to-noise  ratio  and  classification  error 

Several  performance  metrics  are  available  for  correlation 
filters  that  describe  attributes  of  the  correlation  plane. 
The  signal  to  noise  ratio  (SNR)  is  just  one  of  them. 
Other  useful  quantities  are  the  peak- to- correlation  en¬ 
ergy,  the  location  of  the  correlation  peak  and  the  in¬ 
variance  to  distortion.  As  correlation  is  typically  used 
to  locate  and  discriminate  objects,  another  important 
measure  of  a  filter’s  performance  is  how  well  it  discrim¬ 
inates  between  different  classes  of  objects.  The  simplest 
case  is  given  by  the  discrimination  between  the  signal 
and  the  noise.  In  this  section  we  will  show  [16,  12]  that 
for  the  classical  matched  filter  maximizing  the  SNR  is 
equivalent  to  minimizing  the  probability  of  classification 
error  Pg  when  the  underlying  probability  distribution 
functions  (PDFs)  are  Gaussians. 

The  classifier  which  minimizes  the  probability  of  er¬ 
ror  is  the  Bayes  classifier.  For  two  normal  distributions. 


Figure  2:  The  cross- correlation  of  the  template  reported 
on  the  right.  Note  the  diffuse  shape  of  the  peak  that 
makes  its  localization  difficult 


the  Bayes  decision  rule  can  be  expressed  as  a  quadratic 
function  of  the  observation  vector  x  as 


^(a;  -  mAf'E/{x  -  m^) 
-  ms) 


A 

M  > 


+ 


(15) 


where  rriAjms  are  the  distribution  means,  the 

covariance  matrices  and  Pa  ,  Pb  the  occurence  probabil¬ 
ities. 

Let  us  consider  two  classes:  a  deterministic  signal 
corrupted  with  white  Gaussian  noise  as  class  A  and  the 
noise  itself  as  class  B.  In  this  case  ttia  “  =  0 

and  =  Eb  =7.  This  means  that  the  components 
of  the  signal  are  uncorrelated  and  have  unit  variance. 
If  we  further  assume  that  the  a  priori  probabilities  of 
occurence  of  these  classes  are  equal,  the  probability  of 
error  (see  also  Figure  1)  is  given  by: 

ex  p{—u^/2)du  (16) 

where  t]  =  with  ^  being  the  Mahalanobis  distance 

between  the  PDFs  of  the  two  classes: 

^  =  {niA  -  msflimA  -  ms)  =  (17) 

and  the  Bayes  decision  rule  simplifies  to: 

a;  G  A  if  x  >  (18) 

aj  G  P  if  4^^x  <  (19) 

The  input  vector  x  is  then  classified  as  signal  or  noise 
depending  on  the  value  of  the  correlation  with  the  uncor¬ 
rupted  signal.  We  have  already  shown  that  correlation 
with  the  signal  maximizes  the  signal  to  noise  ratio,  so 
when  the  noise  distribution  is  Gaussian  maximizing  the 
SNR  is  equivalent  to  minimizing  the  classification  error 
probability.  When  the  noise  is  not  white,  the  signal  can 
be  transformed  by  applying  a  whitening  transformation 
A: 

A^EA  =  I  (20) 

and  the  previous  reasoning  can  be  applied. 


Generalized  Matched  Filters 


3  Synthetic  Discriminant  Functions 


While  correlators  are  optimal  for  the  recognition  of  pat¬ 
terns  in  the  presence  of  white  noise  they  have  three  major 
limitations:  the  output  of  the  correlation  peak  degrades 
rapidly  with  geometric  image  distortions,  the  peak  is  of¬ 
ten  broad  (see  Figure  2),  making  its  detection  difficult, 
and  they  cannot  be  used  for  multiclass  pattern  recog¬ 
nition.  It  has  been  noted  that  one  can  obtain  better 
performance  from  a  multiple  correlator  (i.e,  one  com¬ 
puting  the  correlation  with  several  templates)  by  form¬ 
ing  a  linear  combination  of  the  resulting  outputs  instead 
of,  for  example,  taking  the  maximum  value  [10,  11]. 
The  filter  synthesis  technique  known  as  Synthetic  Dis¬ 
criminant  Functions  (hereafter  SDF)  starts  from  this 
observation  and  builds  a  filter  as  a  linear  combination 
of  MSFs  for  different  patterns  [9,  6].  The  coefficients 
of  the  linear  combination  are  chosen  to  satisfy  a  set  of 
constraints  on  the  filter  output,  requiring  a  given  value 
for  each  of  the  patterns  used  in  the  filter  synthesis.  By 
forcing  the  filter  output  to  different  values  for  different 
patterns,  multiclass  pattern  recognition  can  be  achieved. 
Let  be  a  set  of  (linearly  independent)  im¬ 

ages  and  u  =  {ui, . . . ,  be  a  vector  representing  the 
required  output  of  the  filter  for  each  of  the  images: 

=  Ui  (21) 

where  0  represents  correlation  (not  convolution).  The 
filter  h  can  be  expressed  as  a  linear  combination  of  the 
images  (j>i\ 

hix)=  (22) 

as  any  additional  contribution  from  the  space  orthogo¬ 
nal  to  the  images  would  yield  a  zero  contribution  when 
correlating  with  the  image  set.  If  we  denote  by  X  the 
matrix  whose  columns  represent  the  images  (represented 
as  vectors  by  concatenating  their  rows),  by  enforcing  the 
constraints  we  obtain  the  following  set  of  equations: 

b={X^X)-^u  (23) 

which  can  be  solved  as  the  images  are  linearly  inde¬ 
pendent.  The  resulting  filter  is  appropriate  for  pattern 
recognition  applications  in  which  the  input  object  can 
be  a  member  of  several  classes  and  different  distorted 
versions  of  the  same  object  (or  different  objects)  can  be 
expected  within  each  class.  If  M  is  the  number  of  classes, 
Tii  is  the  number  of  different  pattern  within  each  class  z, 
N  the  overall  number  of  patterns,  M  filters  can  be  built 
by  solving 

bi  =  {X^xy^Si  (24) 

where 


^ik  = 


{ 


1  nj  <k<  rij 

0  otherwise 


(25) 


k  =  1, . . . ,  iV  and  image  (j)^  belongs  to  class  i  if  Sik  =  1. 
Discrimination  of  different  classes  can  be  obtained  also 
using  a  single  filter  and  imposing  different  output  values. 
However  the  performance  of  such  a  filter  is  expected  to 
be  inferior  to  that  of  a  set  of  class  specific  filters  due 


Training  images 


□  SDF 
O  Ls  SDF 
^  Average 
0  Maximum 


Figure  3:  An  increasing  portion  of  a  set  of  30  eyes  images 
was  used  to  build  a  SDF,  an  average  MSF  or  a  set  of 
prototype  MSFs  from  which  the  highest  response  was 
extracted.  Our  new  least  square  SDF  uses  four  building 
templates.  The  plot  report  the  average  responses  over 
a  disjoint  30  image  test  set.  Note  that  the  lower  values 
of  MSFs  are  due  to  the  fact  that  their  response  is  not 
scaled  to  obtained  a  predefined  value  as  opposed  to  SDFs 
whose  output  is  constrained  to  be  1,  and  to  approximate 
1  for  Is  SDFs. 


to  the  high  number  of  constraints  imposed  on  the  filter 
outputs  [9].  While  this  approach  makes  it  easy  to  ob¬ 
tain  predefined  values  on  a  given  set  of  patterns  it  does 
not  allow  to  control  the  off-peak  filter  response.  This 
can  prevent  reliable  classification  when  the  number  of 
constraints  becomes  large. 

The  effect  of  filter  clutter  can  also  appear  in  the  con¬ 
struction  of  a  filter  giving  a  fixed  response  over  a  set  of 
images  belonging  to  the  same  class  (the  Equal  Correla¬ 
tion  Filter  introduced  in  [9]). 

In  order  to  minimize  this  problem  we  propose  a  new 
variant  of  SDFs:  least  quares  SDFs.  These  filters  are 
computed  using  only  a  subset  of  the  training  images^ 
and  the  coefficient  of  the  linear  combination  are  chosen 
to  minimize  the  square  error  of  the  filter  output  on  all  of 
the  available  images.  In  this  case  the  matrix  R  =  X^X 
is  rectangular  and  the  estimate  of  the  6  relies  on  the 
computation  of  the  pseudoinverse  of  R: 

R)  =  (R^R)-^R^  (26) 

The  dimension  of  the  matrix  to  be  inverted  is  nxn  where 
n  represents  the  number  of  images  used  to  build  the  fil¬ 
ter  and  not  the  (greater)  number  of  training  images.  By 
using  a  reduced  number  of  building  templates  the  prob¬ 
lem  of  filter  cluttering  is  reduced.  A  different  use  of  least 
square  estimation  for  filter  synthesis  can  be  found  in  [6] 
where  it  is  coupled  to  Karhunen-Loeve  expansion  for  the 
construction  of  correlation  SDFs. 

^The  subset  of  training  images  can  be  chosen  in  a  variety 
of  ways.  In  the  reported  experiments  they  were  chosen  at 
random.  Another  possibility  is  that  of  clustering  the  available 
images,  the  number  of  clusters  being  equal  to  the  number  of 
images  used  in  filter  synthesis. 


(28) 


Figure  4:  The  MSFs  resulting  from  using  20  building 
images  in  the  SDF  (left)  and  2  in  the  least  square  SDF 
(right)  when  using  the  same  set  of  training  images.  The 
difference  in  contrast  of  the  two  images  reflect  the  mag¬ 
nitude  of  the  MSFs.  The  performance  of  the  two  filters 
was  similar. 

The  results  for  a  sample  application  are  reported  in 
Figure  3.  Note  that  by  using  a  least  square  estimate  a 
good  performance  can  be  achieved  using  a  small  number 
of  templates.  This  has  a  major  influence  on  the  appear¬ 
ance  of  the  resulting  MSF  as  can  be  seen  in  Figure  4. 

Another  variant  is  to  use  symbolic  encoded  filters  [9]. 
In  this  case  a  set  of  k  filters  is  built  whose  outputs  are 
0  or  1  and  can  be  used  to  encode  the  different  patterns 
using  a  binary  code.  In  order  to  use  the  filter  for  clas¬ 
sification,  the  outputs  are  thresholded  and  the  resulting 
binary  number  is  used  to  index  the  pattern  class. 

Synthesis  of  the  MSF  from  a  projection  SDF  algo¬ 
rithm  can  achieve  distortion  invariance  and  retain  shift 
invariance.  However,  the  resulting  filter  cannot  prevent 
large  sidelobe  levels  from  occurring  in  the  correlation 
plane  for  the  case  of  false  (or  true)  targets.  The  next 
section  will  detail  the  construction  of  filters  which  guar¬ 
antee  controlled  sharp  peaks  and  good  noise  immunity. 

4  Advanced  SDFs 

The  signal  to  noise  ratio  maximized  by  the  MSF  is  lim¬ 
ited  to  the  correlation  peak:  it  does  not  take  into  account 
the  off-peak  response  and  the  resulting  Alters  often  ex¬ 
hibit  a  sustained  response  well  apart  from  the  location 
of  the  central  peak.  This  effect  is  usually  amplified  in 
the  case  of  SDF  when  many  constraints  are  imposed  on 
the  filter  output.  In  order  to  locate  the  correlation  peak 
reliably,  it  should  be  very  localized  [14].  However,  it  can 
be  expected  that  the  greater  the  localization  of  the  filter 
response  (approaching  a  6  function)  the  more  sensitive 
the  filter  to  slight  deviations  from  the  patterns  used  in 
its  synthesis.  This  suggests  that  the  best  response  of  the 
filter  should  not  really  be  a  ^  function,  but  some  shape, 
like  a  Gaussian,  whose  dispersion  can  be  tuned  to  the 
characteristics  of  the  pattern  space.  In  this  section  we 
will  review  the  synthesis  of  such  filters  in  the  frequency 
domain  [26]. 

Let  us  assume  for  the  moment  that  there  is  no  noise. 
The  correlation  of  the  f-th  pattern  with  the  filter  h  is 
represented  by 

Zi{ri)  -  (j)i{n)  ^  h{ri)  n  =  0,  (27) 


an  output  Ui  for  each  training  image: 

Zi{0)  =  Ui 

which  can  be  rewritten  in  the  Fourier  domain  as: 

H^X  =  du  (29) 

where  denotes  complex  conjugate  transpose. Using 
ParsevaFs  theorem,  the  energy  of  the  i-ih.  circulani  cor¬ 
relation  plane  is  given  by: 

Ei  =  E  \zi{n)\^  =  i  ^  \Zi{k)\''  =  ]"'£  |i?(fc)|2|$,(fe)P 

n=0  “  *=0  k=0 

(30) 

When  the  signal  is  perturbed  with  noise  the  output 
of  the  filter  will  also  be  corrupted: 

Zi{0)  =  (l>i{0)  0  h{Q)  -h  A(0)  0  h{0)  (31) 

Under  the  assumption  of  zero-mean  noise,  the  variance 
of  the  filter  output  due  to  noise  is: 

1 

En  =  ■^'£mk)\^S{k)  (32) 

Jb=0 

where  S{k)  is  the  noise  spectral  energy.  What  we  would 
like  is  a  filter  whose  average  correlation  energy  over  the 
different  training  images  and  noise  is  as  low  as  possible 
while  meeting  the  constraints  on  the  filter  outputs.  A 
first  choice  is  to  minimize: 

E  =  +  ^n)  (33) 

* 

=  +  (34) 

i  k 

subject  to  the  constraints  of  eqn.  (29).  However,  mini¬ 
mizing  the  average  energy  (or  filter  variance  due  to  noise) 
does  not  minimize  each  term,  corresponding  to  a  partic¬ 
ular  correlation  energy  (or  noise  variance).  A  more  strin¬ 
gent  bound  can  be  obtained  by  considering  the  spectral 
envelope  of  the  different  terms  in  eqn.  (34): 

E  =  'Y^  |ff(A;)p  max($i(Ar)p, . . . ,  |#jv(^)P,  S{k))  (35) 

k 

If  we  introduce  the  diagonal 

matrix  Tfcifc  =  iVmax(|$i(fc)P, . . |$jv(fc)P,  ^(fc))  the 
filter  synthesis  can  be  summarized  as  minimizing 

E  =  H^TH  (36) 

subject  to 

H^X  =  du  (37) 

This  problem  can  be  solved  [20]  using  the  technique  of 
Lagrange  multipliers  to  minimize  the  function: 

N 

£  =  H+TH-2J2  A,(H+Xi  -  dui)  (38) 

i=l 


where  d  is  the  dimension  of  the  patterns.  In  the  follow-  where  Ai, . . . ,  are  the  parameters  introduced  to  sat¬ 
ing  capital  letters  are  used  to  denote  the  Fourier  trans-  isfy  the  constrained  minimization.  By  zeroing  the  gra- 

formed  quantities.  The  filter  is  also  required  to  produce  dient  of  S  with  respect  to  H  we  can  express  H  as  a 

0 


Figure  5:  Using  an  increasing  amount  of  added  white 
noise  the  emphasis  given  to  the  high  frequency  is  reduced 
and  the  resulting  filter  response  approaches  that  of  the 
standard  MSF. 


function  of  T  and  of  A  =  {Ai, . . Ajv}.  By  substitution 
into  eqn.  (37)  the  following  solution  is  found: 

H  =  (39) 

The  use  of  the  spectral  envelope  has  the  effect  of  reduc¬ 
ing  the  emphasis  given  by  the  filter  to  the  high  frequency 
content  of  the  signal,  thereby  improving  intraclass  per¬ 
formance.  It  is  important  to  note  that  the  resulting  filter 
can  be  seen  as  a  cascade  of  a  whitening  filter  and  a 

conventional  SDF  based  on  the  transformed  data.  Note 
that  in  this  case  the  whitened  spectrum  is  the  envelope 
of  the  spectra  of  the  real  noise  and  of  the  training  im¬ 
ages.  A  least  square  approach  may  again  be  preferred 
to  cope  with  a  large  number  of  examples.  In  this  case 
all  available  images  are  used  to  estimate  T  but  only  a 
subset  of  them  is  used  to  build  the  corresponding  SDF. 
Experiments  have  been  reported  using  a  white  noise  of 
tunable  energy  a  to  model  the  intraclass  variability  [26] 

|//(fe)|2max(|$i(fc)|2, . . . ,  a)  (40) 

k 

Adding  white  noise  limits  the  emphasis  given  to  high  fre¬ 
quencies,  reducing  the  sharpness  of  the  correlation  peak 
and  increasing  the  tolerance  to  small  variations  of  the 
templates  (see  Figures  5  and  6).  A  comparison  of  dif¬ 
ferent  filters  is  reported  in  Figure  7.  The  effects  of  non 
linear  processing  emphasizing  the  high  frequencies  to  ob¬ 
tain  a  sharp  correlation  peak  is  reported  in  Figure  8. 

Another  way  of  controlling  the  intraclass  performance 
is  that  of  modeling  the  correlation  peak  shape  [8,  18].  As 
already  mentioned,  the  shape  of  the  correlation  peak  is 
expected  to  be  important  both  for  its  detection  and  for 
the  requirements  imposed  on  the  filter  which  can  impair 
its  ability  to  correlate  well  with  patterns  even  slightly  dif¬ 
ferent  from  the  ones  used  in  the  training.  Let  us  denote 
with  F{k)  the  required  shape  of  the  correlation  peak. 
The  shape  of  the  peak  can  be  constrained  by  minimizing 


the  squared  deviations  of  its  output  from  the  required 
shape  F: 


N  d 

=  (41) 

i=l  A;=l 

where,  for  instance,  f{x)  =  exp{—x^l2(T^)  is  a  Gaussian 
amplitude  function.  By  switching  to  matrix  notation, 
the  resulting  energy  can  be  expressed  as; 

Es  =  H^DH  +  F+F  -  il+AF  -  F+A+lf  (42) 

where  A  is  a  diagonal  matrix  whose  elements  are  the 
sum  of  the  components  of  and  Z)  is  a  diagonal  ma¬ 
trix  whose  elements  are  the  sum  of  the  squares  the  com¬ 
ponents  of  The  first  term  in  the  RHS  of  eqn.  (42) 
corresponds  to  the  average  correlation  energy  of  the  dif¬ 
ferent  patterns  (see  eqn.  (30)).  We  suggest  the  use  of 
the  spectral  envelope  T  instead  of  D,  employed  in  the 
original  approach,  thereby  minimizing  the  following  en¬ 
ergy 

E's  =  iZ+rif-hF+F->Fr+AF-F+A+ir  >  Es  (43) 

The  minimization  of  Es  subject  to  the  constraints  of  eqn. 
(29)  can  be  done  again  using  the  Lagrange  multiplier  and 
is  found  to  be: 

H  =  T-^X{X^T~^X)-^du  (44) 

-VT-^AF  -  T-^X(X+T~^X)-^A:+T-^AF 

These  filters  provide  a  controlled,  sharp  correlation  peak 
subject  to  the  constraints  on  the  filter  output,  the  re¬ 
quired  correlation  peak  shape  and  the  reduce  variance 
to  the  noise.  In  our  experiments  the  Fourier  domain 
was  used  to  compute  the  whitening  filters.  They  were 
then  transformed  to  the  spatial  domain  where  a  stan¬ 
dard  correlation  was  computed  after  their  application. 
An  approach  using  only  computations  in  the  space  do¬ 
main  can  be  found  in  [28] . 

5  Nonorthogonal  Image  Expansion  and 
SDF 

In  this  section  we  review  an  alternative  way  of  looking  at 
the  problem  of  obtaining  sharp  correlation  peaks,  namely 
the  use  of  nonorthogonal  image  expansion  [2,  3].  Match¬ 
ing  by  expansion  is  based  on  expanding  the  signal  with 
respect  to  basis  functions  (BFs)  that  are  all  translated 
versions  of  the  template.  Such  an  expansion  is  feasible 
if  the  BFs  are  linearly  independent  and  complete.  It 
can  be  proven  that  self-similar  BFs  of  compact  support 
are  independent  and  complete  under  very  weak  condi¬ 
tions.  Suppose  one  wants  to  estimate  the  discrete  d- 
dimensional  signal  g{x)  by  a  linear  combination  of  basis 
functions  (f>i{x): 


d 

s'i^)  =  E  (45) 

1  =  1 

where  ^i(x)  now  represents  the  i-th  circulated  transla¬ 
tion  of  <j).  The  coefficients  are  estimated  by  minimizing 


Figure  6:  The  output  of  the  correlation  with  an  SDF 
computed  using  the  spectral  envelope  of  10  training  im¬ 
ages  and  different  amounts  of  white  noise  (left:  a  =  1, 
middle  a  =  5)  compared  to  the  output  of  normalized 
cross-correlation  using  one  of  the  images  used  to  build 
the  SDF  but  without  any  spectral  enhancement.  The 
darker  the  image  the  higher  the  corresponding  value. 
Note  that  an  increased  amount  of  white  noise  improves 
the  response  of  the  filter. 


Figure  7:  The  output  of  the  correlation  with  an  SDF 
computed  using  the  spectral  envelope  of  20  training  im¬ 
ages  as  whitening  preprocessing.  Left:  the  normal  SDF 
(20  examples).  Right:  a  least  square  SDF  with  6  tem¬ 
plates  (20  examples).  The  darker  the  image  the  higher 
the  corresponding  value.  The  least  square  SDF  exhibits 
a  sharper  response  using  the  same  whitening  filter. 
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Figure  8:  Non  linear  processing  can  be  employed.  The 
figure  represent  the  result  of  preprocessing  the  image 
to  extract  the  local  image  contrast  (intensity  value  over 
the  average  value  in  a  small  neighborhood).  This  kind 
of  preprocessing  emphasizes  high  frequencies  and  results 
in  a  sharp  correlation  peak. 


the  square  error  of  the  approximation  l^f  —  The  ap¬ 
proximation  error  is  orthogonal  to  the  basis  functions  so 
that  the  following  system  of  equations  must  be  solved: 

d 

>  Cj  =  <g,<f>i>  (46) 

j=l 

d 

X)  <  4>j  >Cj  =  <  g,  > 

If  the  set  of  basis  functions  is  linearly  independent  the 
equations  give  a  unique  solution  for  the  expansion  coef¬ 
ficients.  If  we  consider  the  advanced  SDF  for  the  case  of 
no-noise,  single  training  image  and  working  in  the  spa¬ 
tial  domain  [28],  we  have  that  the  corresponding  filter 
can  be  expressed  as: 

h  =  {[<l>n<t>]V<l>  (47) 

where  the  columns  of  matrix  [•]  are  the  circulated  basis 
functions  <j>i.  The  output  of  the  correlation  is  then  given 
by: 

[(j)]h  =  c  =  (48) 

The  solution  of  the  system  (47)  can  be  expressed  as: 

(49) 

which  is  clearly  the  same.  In  the  case  of  no  noise  the 
resulting  expansion  is  c  =  (0, . . . ,  0, 1,  0, . .  .0)  with  a  sin¬ 
gle  1  at  the  location  of  the  signal.  The  idea  of  expansion 
matching  is  also  closely  related  to  correlation  SDFs  [6] 
where  multiple  shifted  templates  were  explicitly  used  to 
shape  the  correlation  peak.  Let  us  consider  a  set  of  tem¬ 
plates  obtained  by  shifting  the  original  pattern  (possibly 
with  circulation)  on  the  regular  grid  defined  by  the  im¬ 
age  coordinates.  We  can  require  that  the  correlation 
value  of  the  original  pattern  with  its  shifted  versions  be 
1  when  there  is  no  shift  and  0  for  every  non  null  shift. 
This  corresponds  to  a  filter  whose  response  is  given  by 
c  =  (0, . . . ,  0, 1,  0, . .  .0)  as  previously  described. 

6  Other  projection  approaches 

The  whole  idea  of  projection  Synthetic  Discriminant 
Functions  is  to  find  a  direction  onto  which  the  pro- 


Figure  9:  Computing  the  distance  from  linear  subspace 
(a)  versus  computing  the  distance  from  a  single  proto¬ 
type  (b).  In  drawing  (a),  vector  <  <j)  >  represents  the 
average  pattern  and  the  horizontal  line  on  which  lies 
represents  the  linear  subspace.  <j)j^  is  the  projection  of 
pattern  (f)  on  the  linear  space  and  is  the  projection 
residual.  Drawing  (b)  shows  two  vectors  and  ^2  with 
the  same  distance  from  the  average  pattern  A  but  dif¬ 
ferent  distances  from  the  pattern  space. 


jections  of  the  different  signals  have  predefined  val¬ 
ues.  A  typical  image  with  256  x  256  pixels  is  pro¬ 
jected,  for  recognition  purposes,  onto  a  single  direction 
in  this  high  dimensional  space.  Another  approach  is  to 
project  the  signal  to  be  recognized  onto  a  linear  subspace 
[32,  27,  22,  29,  30,  31],  Let  us  first  assume  that  the  pat¬ 
terns  of  each  of  the  classes  to  be  discriminated  belongs  to 
different  linear  subspaces.  For  each  class  it  is  then  pos¬ 
sible  to  determine  an  orthogonal  transformation  which 
diagonalizes  the  covariance  matrix.  The  elements  of  the 
transformed  basis  are  the  eigenvectors  of  the  covariance 
matrix  and  are  called  principal  components.  They  can 
be  sorted  by  decreasing  contribution  to  the  covariance 
matrix,  as  represented  by  the  corresponding  entry  in  the 
diagonal  covariance  matrix  [12].  The  number  of  vectors 
in  the  basis  is  equal  to  the  minimum  between  the  number 
of  available  class  pattern  and  the  dimensionality  of  the 
embedding  space.  Each  pattern  in  the  class  can  usually 
be  described  by  using  only  the  most  important  compo¬ 
nents.  The  resulting  restricted  basis  spans  a  linear  sub¬ 
space  in  which  the  patterns  of  the  represented  class  can 
be  found.  Each  possible  pattern  <j)  can  be  projected  onto 
the  set  of  principal  components  and  can  be  described  as 
the  sum  of  its  projection  plus  an  orthogonal  residual 

<t>  =  <  <l>i  >  (50) 

where  i  identifies  the  class  and  <  <f)i  >  is  the  correspond¬ 
ing  centroid.  A  comparison  with  the  usual  technique  of 
computing  the  distance  from  a  single  pattern  (e.g,  the 
centroid)  is  reported  in  Figure  9. 

An  important  class  of  objects  spanning  a  linear  space 
is  given  by  the  ortographic  projections  of  rigid  sets  of 
points  when  looked  at  from  different  positions[l,  23]. 
Different  objects  span  different  6-dimensional  linear 
spaces.  This  can  be  used  to  recognize  them,  irrespective 
of  their  orientation  in  space,  by  computing  the  magni- 
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Figure  10:  The  square  of  coordinates  ij  represents  the 
average  value  of  distances  of  views  of  the  ?-th  and 
j-th  clip  (darker  values  represent  smaller  distances). 
LEFT:  euclidean  distances  of  views  of  the  different  clips; 
RIGHT:  distances  computed  using  the  learned  metric 
W^Wo 


□  Eigenvalues 


Fi^re  11:  Eigenvalues  of  the  learned  metric  matrix 
W^W.  Note  that  there  is  compatibility  with  the  find¬ 
ings  of  Basri-Ullman  that  under  ortographic  projection 
the  rank  of  the  metric  is  6. 

tude  of  the  projection  residual  over  the  individual  lin¬ 
ear  spaces  (see  Figure  9).  Under  perspective  projection, 
when  viewing  an  object  from  a  reasonable  distance,  we 
expect  that  a  6-dimensional  linear  space  can  still  provide 
a  good  approximation  to  the  real  manifold.  Further  anal¬ 
ysis  of  the  recognition  experiments  reported  in  [5]  has 
shown  that  an  HyperBF  network  [25]  with  a  single  unit 
is  in  fact  able  to  learn  the  approximating  linear  space 
from  a  set  of  example  views  of  different  objects.  The 
experiments  used  paper  clips  characterized  by  6  feature 
points  in  the  image  plane,  resulting  in  12-dimensional 
vectors  after  perspective  projection.  The  i-th  clip  was 
characterized  by  a  one-unit  HyperBF  network: 


=  expi-{<j>  -  tifwT  -  ti))  (51) 

where  (j)  is  the  12- dimensional  input  to  the  network,  ti  is 
a  sample  view  (a  prototype)  of  the  i-th  clip  and  W^Wi 
represents  a  metric.  The  network  is  trained  by  modifying 
ti  and  W^Wi  to  obtain  Ci{<f>)  ^  1  when  ^  is  a  view  of 
the  i-th  clip  and  Ci{(l))  «  0  when  it  is  not.  The  effects 
of  the  resulting  metric  WfWi  on  the  computation  of 
distances  between  different  views  of  the  clips  can  be  seen 


in  Figure  10.  The  distance  computed  using  the  learned 
metric  is  effectively  the  size  of  the  projection  residual. 
The  eigenvalues  of  W'fWi  (see  Figure  11)  are  compatible 
with  a  6-dimensional  embedding  of  the  pattern  space. 

If  the  linear  subspace  is  the  one  spanned  by  the  first 
k  eigenvectors  of  the  covariance  matrix,  the  sum  of  the 
eigenvalues  corresponding  to  the  ignored  components 
can  be  used  as  an  estimate  for  when  the  pattern 
belongs  to  the  given  class.  In  particular  it  can  be  used 
to  accept  or  reject  the  pattern  according  to  a  threshold 
on  the  size  of  the  residual 

\<t>R?  <  6  (52) 

where 

(53) 

i>k 

and  /?  >  1  is  an  heuristic  factor  taking  into  account  how 
good  an  estimate  Xi  is  of  the  residual  error.  In  Fig¬ 
ure  12  we  report  the  fraction  of  image  pixels  classified 
as  right  eye  as  a  function  of  the  threshold  on  the  resid¬ 
ual.  The  fact  that  the  residual  is  small  (compared  to 
(5)  does  not  imply  that  the  pattern  belongs  to  the  given 
class.  Thresholding  on  the  residual  error  should  then 
be  supplemented  by  the  use  of  classification  techniques 
in  each  of  the  linear  subspaces,  taking  into  account  the 
distributions  of  the  patterns.  If,  for  instance,  the  distri¬ 
bution  of  the  points  in  the  linear  subspace  is  Gaussian, 
the  parameters  of  the  distribution  can  be  computed  and 
the  probability  of  a  pattern  with  given  coordinates  esti¬ 
mated  (see  Figure  13  where  an  example  is  reported).  If 
we  denote  by  xi  the  z-th  component  of  (j>  the  following 
relation  holds  if  the  distribution  in  the  feature  space  is 
Gaussian: 

V  oc  (54) 

where  n  is  the  number  of  patterns  used  for  computing 
the  principal  components.  The  resulting  map  can  be 
used  in  conjunction  with  the  distance  map  (see  Figure 
14)  to  establish  if  a  pattern  of  the  correct  class  is  present 
(for  a  similar  approach,  using  the  distance  from  the  cen¬ 
troid  in  the  projection  space  [29,  30,  31]).  Note  that  in 
this  particular  case  the  probability  map  is  much  more 
effective  than  the  residual  distance  map. 

It  could  be  that  a  class  cannot  be  packed  tightly  into 
a  linear  subspace.  A  possible  improvement  is  to  attempt 
local  expansions  [27,  22,  31].  Points  can  be  clustered,  and 
for  each  of  the  resulting  clusters  a  principal  component 
analysis  can  be  attempted.  The  previous  reeisonings  can 
be  applied  and  the  class  is  represented  by  a  set  of  linear 
subspaces.  A  nice  application  of  this  approach  can  be 
found  in  [22]  where  the  space  spanned  by  faces  is  first 
clustered  into  views  corresponding  to  different  poses  and 
the  resulting  clusters  are  then  described  by  the  most 
important  principal  components. 

7  An  experimental  comparison 

In  order  to  clarify  the  practical  relevance  and  the  relative 
merits  of  the  previous  template-matching  techniques,  it 
is  useful  to  compare  them  on  a  single  task.  We  choose  to 
assess  the  performance  of  the  different  techniques  on  the 
problem  of  locating  eyes  in  frontal  images  of  faces.  This 


Figure  12:  The  fraction  of  image  pixels  classified  as  right 
eye  as  a  function  of  the  threshold  on  the  residual.  The 
first  10  eigenvectors  from  a  population  of  60  images  were 
used.  The  image  used  for  the  plot  was  of  a  person  not 
in  the  database.  The  vertical  line  represent  the  thresh¬ 
old  computed  by  summing  the  residual  eigenvalues.  The 
correct  eye  is  the  only  selected  region  for  d  <  11,  the 
other  eye  being  selected  next. 


Figure  13:  The  distribution  of  the  values  of  the  5-th  prin¬ 
cipal  component  computed  from  60  eyes  images.  Note 
the  clear  unimodality  of  the  distribution  which  suggests 
the  effectiveness  of  using  a  quadratic  classifier  in  the  fea¬ 
ture  subspace.  The  other  components  present  a  similar 
distribution. 


Figure  14:  The  map  of  the  residual  size  (left)  and  of  the 
projection  probability  (right).  Note  how  the  probability 
is  low  in  regions  where  the  reconstruction  is  good.  The 
darker  the  value  the  lower  the  distance  and  the  higher 
the  probability. 
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Figure  15:  Distribution  of  the  distance  values  orthogonal 
to  the  feature  space  when  projecting  onto  the  first  10 
eigenvectors. 
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Figure  16:  Distribution  of  the  distance  values  within  the 
feature  space  when  projecting  onto  the  first  10  eigenvec¬ 
tors. 


Figure  17:  Performance  of  different  strategies  based  on 
the  computation  of  principal  components.  The  hori¬ 
zontal  axis  reports  the  number  of  components  used  in 
the  expansion,  while  the  vertical  axis  reports  the  per¬ 
centage  of  eyes  correctly  located  (see  text  for  a  detailed 
explanation). 
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Figure  18:  Performance  of  least  squares  SDF  with  dif¬ 
ferent  amount  of  regularizing  noise.  Correlation  perfor¬ 
mance  is  also  reported  using  the  average  template  and 
the  whole  set  of  available  templates.  The  horizontal 
axis  reports  the  number  of  patterns  used  in  building  the 
filters,  while  the  vertical  axis  reports  the  percentage  of 
eyes  correctly  located  (see  text  for  a  detailed  explana¬ 
tion). 
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is  a  preliminary  step  for  identifying  the  represented  per¬ 
son  by  comparing  his/her  image  to  a  reference  database. 
The  available  database  consisted  of  180  images  (three 
images,  taken  at  different  time,  from  sixty  different  peo¬ 
ple).  The  eyes  were  manually  located  and  images  nor¬ 
malized  by  fixing  the  position  of  the  eyes  to  standard 
values.  The  resulting  normalized  images  (with  an  inte¬ 
rocular  distance  of  28  pixels)  were  used  for  the  experi¬ 
ments.  Three  different  disjoint  subsets,  each  consisting 
of  the  images  from  18  different  people  were  used  in  turn 
for  building  the  SDFs,  IsSDF  and  KL  expansions.  Per¬ 
formance  was  then  assessed  on  the  remaining  images. 

For  each  of  the  compared  strategies  and  testing  im¬ 
ages,  a  map  was  computed  reporting  at  each  pixel  the 
absolute  difference  of  the  computed  values  (residual,  cor¬ 
relation,  etc.)  from  the  required  values  at  the  pattern 
(i.e.  eye)  location  (e.g.  0  for  the  residual,  1  for  corre¬ 
lation).  The  resulting  maps  could  then  be  considered 
as  distance  maps.  For  each  image  we  masked  in  turn 
the  region  of  the  left  and  right  eye.  The  unmasked  eye 
was  considered  to  be  located  correctly  if  the  smallest  dis¬ 
tance  value  was  within  8  pixels  from  the  correct  location 
(manually  detected). 

As  far  as  the  SDFs  and  IsSDF  are  concerned,  a  single 
image  from  the  represented  persons  was  used  in  building 
the  Mters,  while  the  computation  of  the  KL  components 
relied  on  all  the  available  images  in  the  training  subset. 
For  all  of  the  techniques  the  test  was  run  on  120  images. 
Both  left  and  right  eyes  were  used  in  building  the  filters 
and  the  expansions.  In  order  to  assess  the  performance 
of  the  techniques,  each  image  was  used  to  locate  both 
eyes  by  masking  in  turn  the  left  and  right  eye  region 
when  looking  for  the  maximum/minimum  values  ideally 
associated  to  the  template  location. 

Several  variants  of  the  KL  approach  have  been  in¬ 
vestigated  using  distances  from  and  within  the  feature 
space.  The  external  distance  de  is  simply  the  error  in  the 
reconstruction  of  the  pattern  using  the  restricted  eigen¬ 
vector  basis  (see  eqn.  50).  The  internal  distance  di  is 
the  distance  computed  within  the  linear  subspace  from 
its  origin  (the  centroid  of  the  patterns).  The  spherical 
internal  distance  dg.  is  the  Mahalanobis  distance  in  the 
linear  subspace. 

Let  us  assume  that  the  orthogonal  vectors  defining 
the  linear  pattern  subspace  are  known  or  computed  reli¬ 
ably  from  a  subset  of  the  available  examples.  We  could 
then  estimate  the  Mahalanobis  distance  by  computing 
the  variance  for  each  of  the  (uncorrelated)  coordinates 
using  all  the  available  samples.  Some  of  the  examples 
could  be  erroneous  or  atypical  and  would  probably  lead 
to  an  overestimated  variance.  In  order  to  overcome 
this  potential  problem,  a  robust  estimate  of  the  scale 
parameter  of  each  coordinate  was  computed  using  the 
ian-h  M-estimators  introduced  by  Hampel  [13].  Finally 
the  combined  distance  (see  also  the  related  approach  in 
[29,  30,  31])  was  computed  by  the  following  relation: 


where  the  normalizing  factors  dgo  and  deo  define  the 
points  at  which  the  cumulative  distribution  of  the  in¬ 
ternal  and  external  distances  reaches  99%  (see  Figures 


15  and  16).  The  performance  of  the  different  variants 
are  reported  in  Figure  17. 

SDFs  and  IsSDF  were  built  using  different  amounts 
of  regularizing  noise  (using  eqn.  (40))  and  of  templates. 
The  resulting  performance,  together  with  the  perfor¬ 
mance  of  standard  correlation  is  reported  in  Fig.  18.  It 
is  interesting  to  note  the  major  impact  of  the  regulariz¬ 
ing  noise  on  the  performance  of  this  technique.  However, 
the  bias  and  variance  of  the  filter  responses  on  the  test 
images  are  not  related  to  the  filter  performance. 

The  combined  distance  dc  is  the  best  among  the  com¬ 
pared  strategies.  Its  decline  in  performance  with  in¬ 
creasing  dimensionality  of  the  expansion  basis  is  linked 
to  the  trend  of  the  external  distance  performance.  By 
using  more  and  more  eigenvectors  we  allow  for  good 
reconstruction  of  patterns  different  from  eyes.  At  the 
same  time  the  scaling  factor  computed  from  the  dis¬ 
tance  distribution  on  the  training  samples  becomes  very 
small  (should  we  use  all  of  the  computed  eigenvectors 
the  samples  could  be  reconstructed  exactly).  Therefore 
the  external  distance  is  the  (wrong)  dominating  factor 
in  eqn. (55).  A  more  sophisticated  integration  is  pre¬ 
sented  in  [31].  The  performance  of  the  template  match¬ 
ing  strategies  based  on  KL  expansions  is  consistently 
higher  than  the  one  achieved  by  SDFs  in  the  reported 
variants.  Also,  expanding  a  pattern  onto  an  appropri¬ 
ate  basis  seems  to  provide  reliable  template  matching  to 
patterns  which  span  a  manifold  which  can  be  approxi¬ 
mated  well  (at  least  locally)  by  a  linear  (tangent)  space 
[1,24,23,5]. 

The  next  section  will  introduce  non  linear  machinery 
(sigmoidal  and  Gaussian  network)  for  the  purposes  of 
pattern  description  and  classification. 

8  Future  Directions:  Learning  and  SDF 

The  description  of  the  advanced  SDFs  has  shown  that 
that  they  can  be  considered  as  standard  SDF  working 
on  a  preprocessed  signal.  The  characteristics  of  the  orig¬ 
inal  signal  and  noise  are  used  in  the  synthesis  of  the 
preprocessing  filter  to  achieve  optimal  sharpness  in  the 
correlator  response.  If  we  look  at  the  patterns  in  the 
transformed  space,  the  correlator  output  is  a  weighted 
average  of  the  correlation  with  a  set  of  examples: 

(i)'{x)®h'{x)=  (56) 

where  the  prime  refers  to  the  transformed  space.  The 
patterns  can  be  randomly  chosen  among  the  avail¬ 
able  examples  or  selected  according  to  particular  criteria. 
A  possible  strategy  is  to  synthesize  the  filter  incremen¬ 
tally:  the  response  of  the  filter  on  all  the  training  images 
not  yet  used  to  build  the  filter  is  computed  and  if  the 
worst  filter  response  is  not  acceptable  the  corresponding 
image  is  added  to  the  building  set  and  a  new  filter  is 
computed  [7].  The  construction  of  the  filter,  apart  from 
the  phase  of  selecting  meaningful  training  images  is  lin¬ 
ear.  An  improvement  is  expected  with  the  introduction 
of  nonlinearity  in  the  filter  design.  We  propose  the  use  of 
approximation  networks  [25]  to  build  general  non  linear 
filters  which  are  able  to  discriminate  patterns  of  differ- 
ent  classes  while  giving  the  same  response  on  patterns 


of  the  same  class.  These  filters  can  be  considered  as  a 
generalization  of  the  projections  Synthetic  Discriminant 
Functions.  They  are  built  using  a  set  of  training  images 
and  a  set  of  soft  (i.e.  not  exactly  met)  constraints  on  the 
filter  output. 

The  general  structure  of  the  network  is  reported  in 
Figure  19.  The  units  of  the  first  level  represent  sigmoidal 
(comparison  by  projection)  or  Gaussians  (comparison  by 
distance)  functions: 

_  /  exp{—{<l}  —  —  ti))  Gaussian 

ou{(p)  —  I  sigmoidal 

(57) 

In  both  cases,  the  system  is  able  to  mask  regions  of  the 
templates  which  are  not  useful  for  obtaining  the  required 
output  values^.  The  first  level  of  the  network  can  be  seen 
as  computing  some  “optimal  templates”  against  which 
the  input  signal  is  to  be  compared.  The  output  of  the 
second  level  is  computed  as: 

^tj02j(oi)  (58) 

i 

and  the  function  implemented  by  unit  02j  can  be  of 
the  Gaussian  or  sigmoidal  type  independently  from  the 
choice  of  the  first  layer  units.  The  second  level  computes 
a  non  linear  mapping  of  the  projections  (or  distances)  of 
the  signal  by  minimizing  the  square  error  of  the  net¬ 
work  on  the  mapping  constraints  (soft,  as  they  are  not 
met  exactly).  In  some  sense,  the  network  triangulates 
the  position  of  the  input  signal  in  pattern  space  using 
the  distances  from  automatically  selected  reference  tem¬ 
plates.  The  resulting  networks  have  a  very  high  number 
of  free  parameters  and  their  training  presents  difficulties. 
Among  them  two  are  of  particular  concern:  overfitting 
and  training  time  in  a  high  dimensional  space.  A  way 
of  coping  with  the  first  one  is  that  of  cross-validation 
[12]:  the  network  undergoes  training  as  long  as  its  per¬ 
formance  on  a  test  set  improves.  We  propose  to  reduce 
the  effects  of  high  dimensionality  by  using  a  hierarchy 
of  networks  of  similar  structure  but  working  at  increas¬ 
ing  resolution.  The  network  with  the  lowest  resolution 
is  trained  first  and  extensively.  The  next  network  in  the 
hierarchy  is  then  initialized  by  suitably  mapping  the  pa¬ 
rameters  of  the  previous  one.  Note  that  only  the  first 
level  needs  to  be  modified  structurally.  A  reduced  train¬ 
ing  time  is  expected.  The  procedure  is  iterated  at  all  the 
levels  of  the  hierarchy.  A  side  effect  of  the  hierarchical 
training  is  to  provide  fully  trained  networks  for  different 
resolutions,  enabling  a  hierarchical  approach  to  template 
matching.  The  preprocessing  stage  of  the  network  is  the 
one  computed  for  the  synthesis  of  the  ASDF.  The  op¬ 
timal  templates  used  by  the  first  layer  of  the  network 
can  also  be  initialized  using  the  building  patterns  of  the 
linear  filter. 

^This  is  achieved  during  the  training  phase  by  modifica¬ 
tion  of  the  entries  of  the  matrix  W,  if  Gaussians  are  used,  or  U 
if  sigmoids  are  used.  Relatively  small  values  give  low  weight 
to  the  differences  of  the  corresponding  coordinates,  thereby 
making  the  system  output  weakly  dependent  on  them. 


Preprocessing 


Figure  19:  An  approximation  network  for  template 
matching.  The  preprocessing  stage  applies  simple  trans¬ 
formation  to  the  input  pattern  (e.g.  to  emphasize  high 
frequency  components). 

9  Conclusions 

Several  approaches  to  template  matching  have  been  re¬ 
viewed  and  compared  on  a  common  task.  A  new  vari¬ 
ant  of  Synthetic  Discriminant  Functions,  based  on  least 
square  estimation,  was  introduced.  Several  template 
matching  techniques  based  on  the  expansion  of  pat¬ 
terns  on  principal  components  have  been  reviewed.  A 
simple  way  of  integrating  internal/external  distances 
within/from  a  linear  feature  space  was  also  proposed. 
Several  of  the  techniques  mentioned  in  the  paper  have 
been  compared  on  a  common  task:  locating  eyes  in 
frontal  images  of  different  people.  The  techniques  based 
on  pattern  expansion  provide  superior  performance,  at 
least  in  the  particular  task  considered.  Finally,  a  two 
layer  approximation  network  has  been  proposed  to  gen¬ 
eralize  the  structure  of  SDF  to  a  nonlinear  filter.  Future 
work  will  explore  the  advantages  and  difficulties  of  the 
introduction  of  nonlinearity. 
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